home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Shareware Grab Bag
/
Shareware Grab Bag.iso
/
014
/
qsort3.arc
/
QSORT.DOC
< prev
next >
Wrap
Text File
|
1987-03-12
|
26KB
|
524 lines
QSORT - Version 3.00
Copyright (c) 1985, 1986, 1987 - Ben Baker - All rights reserved
QSORT may be freely copied and distributed. provided that 1) it is
distributed under the name "QSORT," and 2) this documentation file
always accompanies it. Vendors wishing to distribute QSORT
commercially, or with commercial products may contact the author at
the address below for terms.
QSORT is made available under the "share-ware" concept. If you find
this program useful, you are asked to send its author a license fee of
$20 for each machine on which you use it. This will encourage the
development of other useful, affordable tools. Send checks to:
Ben Baker
RR #1, Box 637
E. Alton, Il 62024
Bug reports or suggestions may be sent to the above address or via
FidoNet mail to Fido node # 100/76, Baker's Acre, or by logging into
Baker's Acre BBS at 618-251-2169.
--- INTRODUCTION ---
QSORT was first designed to be a replacement for, and to overcome the
limitations of DOS SORT, but has been enhanced a number of times and
moved to new compilers twice. The current version will sort files
whose size is limited only by available disk space. File name(s) may
be given explicitly or QSORT will sort from standard input to standard
output, and so, may be used in pipes or with redirection. Multiple
key fields may be specified. Binary files with fixed-length records
may be sorted, provided only that keys are ASCII fields.
QSORT tries to be very protective of your data. If QSORT has an error
of any kind, it will terminate with the input file still intact, and
will return to DOS with ERRORLEVEL = 1. When QSORT successfully
completes a sort, it terminates with ERRORLEVEL set to 0. (See the
batch commands in your DOS manual for more information on ERRORLEVEL.)
The command line syntax is a super-set of SORT's, so QSORT may be used
without other changes in batch files using SORT, but in most cases you
will probably want to make use of QSORT's greater capabilities.
--- The QSORT Command and Options ---
QSORT is invoked with the following command:
QSORT [<in_file> [<out_file>]] [/<key_spec>. . .]
[/T[<tag>] | /F<len>] [/R] [/S] [/?]
Note that all parameters on the command line are optional.
The <in_file> and <out_file> parameters are "ASCII-Z" file specifiers.
They may contain disk and path information in the standard DOS format,
but must not contain "wild-card" characters.
If <in_file> is missing, QSORT sorts from standard input to standard
output. These are files defined and opened by DOS before QSORT is
loaded. (See your DOS manual concerning the use of redirection and
pipes.)
If <in_file> is given but <out_file> is missing, QSORT creates a
temporary file in the directory containing <in_file> and sorts to the
temporary file. On successful completion of the sort, <in_file> is
deleted and the temporary is renamed to <in_file>. The effect is an
apparent "sort-in-place."
If both file names are given, <in_file> is unchanged and the sorted
output is written to <out_file>. This feature was added in version
3.00 as a convenience. New DOS users sometimes have difficulty with
the concept of redirection. Note that the following two commands are
exactly equivalent:
QSORT FILE.TXT FILE.SRT
QSORT <FILE.TXT >FILE.SRT
In the first, QSORT opens the files. In the second, redirection is
specified and DOS opens the files. The result is the same. It is an
error QSORT can't detect if you mix these. For instance:
QSORT FILE.TXT >FILE.SRT
will result in a sort-in-place. QSORT will open FILE.TXT but won't
know DOS has opened FILE.SRT for it, and will ignore it.
The /<key_spec> Parameter
Up to 30 /<key_spec> parameters may be used to specify key fields and
are ordered major to minor from left to right. The /<key_spec>
argument has the form:
/[L][+|-][col][:width]
Note that all elements of this argument are "optional," but at least
one element must be present following the slant-bar (/). The 'L', if
present, specifies "lexical" sequence for this field. (See discussion
of lexical sequence below.) The minus (-) sign specifies reverse order
for this field. The plus (+) sign (or no sign) specifies normal sort
order. If present, [col] defines the beginning column of the field.
If omitted, column 1 is assumed. If present, [:width] defines the
field width in columns. If width is omitted, the rest of the record
is assumed to be part of the key. If no key field parameters are
given, the entire record is the key field. When sorting
variable-length records, any key field which falls beyond the end of a
particular record is treated as a null (zero length) field for that
record, and collates low. When sorting fixed-length records, all
defined fields must fall within the defined record length.
The /F<len> Parameter
The /F<len> parameter defines the record length for a file of
fixed-length records. All records in the input file MUST be exactly
<len> bytes long. The records need not (but may) be terminated with a
CR/LF sequence. They may contain any data, even binary data, but the
key fields must be ASCII strings. Strings may be terminated with a
null (binary zero) character, or may be padded with trailing spaces to
the full length of the field. Note that QSORT does not attempt to
support Pascal style strings. These are strings which begin with a
character whose binary value is a character count. This is followed
by <count> characters of ASCII data, which in turn is followed by
random data out to the maximum length of the string. These strings
may be used as key fields, but the programmer must insure that either
the last real character is a null character, or the field is padded to
its full length with spaces. QSORT must be told that the key begins
in the second character position (the first character of real data).
The /T[<tag>] Parameter
The /T[<tag>] parameter, if present, indicates that the "records" to
be sorted may be more than a single line long. If <tag> is also
present, it defines a character to be used to tag the "end-of-record."
If <tag> is not present, the first empty line terminates the record.
For this purpose, "empty" means "no characters." A line containing but
a single space is NOT empty! A line may be "tagged" by placing the
<tag> character anywhere on the last line of a logical record. The
entire line, including the tag character will appear as the last line
of the record.
Note that the /F<len> and /T[<tag>] parameters are incompatible, and
may not both be specified.
The /R Parameter
The /R parameter is included for compatibility with DOS SORT and is
redundant. It reverses the sense of sort direction for all key
fields.
The /S Parameter
The /S parameter tells QSORT to make a statistics report to the screen
at the end of a run. The report is written to the "standard error"
device, the console, and may not be redirected. The following is an
actual statistics report "cut" from the screen after QSORT had sorted
a 1 mega-byte+ file:
QSORT - Text Sort - Version 3.00
Copyright (c) 1985, 1986, 1987 - Ben Baker - All rights reserved
10131 records sorted
146 bytes in longest record
116744 sort phase comparisons
53771 merge phase comparisons
170515 total comparisons
16.8 comparisons per input record
21 temporary merge files created
2 merge passes
2.3 average passes over data
6:00 elapsed time
The first two numbers are self-explanatory. The next two are the
number of times two records were compared during the sort phase and
the merge phase respectively, followed by the total comparisons.
The next number is total comparisons divided by the number of records
in the input file. If there is no merge phase, this number is
typically 10 to 12. If the file is large enough to require merging,
it is 12 to 20, on average. If it is much larger than 20, it usually
means that there is something unusual about your input file. It may
already be sorted, or there may be large blocks of records which
compare equal. This can happen if you sort on, say column 50 and the
input file contains a large number of records shorter than 50 bytes.
In this case, a minor sort key at column 1 may significantly speed
sorting.
The next two items are self-explanatory. "Average passes over data"
reflects the number of times each record was read and written. For
short files not requiring a merge pass, this number will be 1.0. When
merging is needed, the last merge pass is the one which writes the
output file and it must read and write every record exactly once.
Thus when only one merge pass is made, there will be exactly 2.0
"average passes over data." In the above case the first merge pass
processed about 30% of the records, hence the value of 2.3.
The above sort was performed on an IBM XT with two hard drives (C and
D). The input file was on C; the temporary merge files were placed
on D; and the output file was written to C. The sort of a 1
mega-byte file took six minutes flat. With my Tiny Turbo accelerator
turned on, the same sort takes about 3:50.
The /? Parameter
The /? parameter requests help or parameter evaluation. When QSORT
is executed with the /? parameter alone, it lists a short description
of the QSORT parameters. If /? is entered as one of several
parameters, QSORT will produce a short report on the screen describing
the sort it would perform based on those parameters without actually
doing a sort.
For example:
QSORT /? /L5:12 /-3:2 /22 /T /R <INFILE.TXT >OUTFILE.TXT
requests a sort on a file of tagged records. It defines three key
fields, one of them inverted (-). The /R parameter reverses the sense
of all three fields. Since redirection is specified, QSORT does not
see or know about the names of the files it will sort. The /?
parameter requests a report, rather than a sort, and QSORT obligingly
produces on the screen, the following:
QSORT - Text Sort - Version 3.00
Copyright (c) 1985, 1986, 1987 - Ben Baker - All rights reserved
With the present arguments, QSORT would sort from STDIN to STDOUT
Records are multiple lines ending with an empty line
Key fields in descending order of importance are:
Pos Len Type
5 12 Lexical Descending
3 2 ASCII
22 65535 ASCII Descending
The third field has an unspecified length. The value "65535" merely
means that this field extends to the end of each record.
The /M<len> supported in earlier versions of QSORT is no longer
required, but will be accepted (and ignored) by QSORT. There is no
"hard-coded" maximum record length in QSORT, but there is a practical
limit. At some time during every sort, the two longest records in the
input file must be compared. Therefore, the two longest records must
be able to fit together in the sort buffer. The sum of their lengths
cannot exceed about 50K -- not an altogether unreasonable limitation.
QSORT can be shoe-horned into tighter memory and will run if it can
find 4K for a sort buffer and 4K for an output buffer, but the two
longest records must still fit in the sort buffer together.
Arguments may appear in any order on the command line except that
<in_file> must appear before <out_file>, and /<key_spec> arguments
must appear in descending order of importance.
--- Lexical Sorting ---
The lexical sorting capability was borne out of my own need to sort
word lists with mixed capitalization. ASCII sequence produced some
bizarre results when words beginning with 'Z' sorted before those
beginning with 'a.' Case-insensitive sorting wasn't much better
because upper and lower case got mixed randomly.
The following table will illustrate what I mean:
INPUT ASCII CASE LEXICAL
INSENSITIVE
DeLaPort Baker Baker Baker
Smith Brown brown Brown
brown DeAngelo bRown bRown
deLaPorte DeLaPort Brown brown
Deangelo Deangelo Deangelo DeAngelo
deAngelo Deangelo deangelo Deangelo
Brown DelaPort Deangelo Deangelo
smith DelaPorte deAngelo deAngelo
delaPorte Harry DeAngelo deangelo
DelaPort Smith delaPort DeLaPort
DeAngelo bRown DelaPort DelaPort
DelaPorte brown delaPort delaPort
deangelo deAngelo DeLaPort delaPort
Harry deLaPorte DelaPorte DelaPorte
delaPort deLaPorte deLaPorte deLaPorte
Baker deangelo delaPorte deLaPorte
deLaPorte delaPort deLaPorte delaPorte
Deangelo delaPort Harry Harry
bRown delaPorte smith Smith
delaPort smith Smith smith
The first column is a list of names in arbitrary order. The second is
an ASCII sort of that list. Third we have one possible
case-insensitive sort of the list. The fourth column is what I really
wanted. It is sorted the way these words would be sorted in a
dictionary (or lexicon). The third and fourth columns both collect
words of identical spellings together, but in the third column, upper
and lower case spellings are in arbitrary order, while the fourth
column places upper case spellings ahead of lower case spellings.
For example, the two occurrences of Smith are widely separated in
column 2 because one is capitalized and the other is not. Column 3
brings the two together, but in the wrong order. They might have been
in the right order, but the order is strictly arbitrary. In column 4,
Smith comes before smith, and lexical sorting will always put them in
this order.
Lexical sorting is achieved by making case-insensitive comparisions of
entire fields. If the fields compare equal, an ASCII comparison is
made to arbitrate ties. In other words, when "lexical" fields in two
records have different spellings, the case-insensitive comparison
determines the order of the records. When "lexical" fields are
spelled the same, the case-sensitive comparison determines the order
of the records.
Lexical fields are defined, as indicated above, by placing the letter
'L' immediately following the slant-bar (/) in <key_spec> definitions.
Lexical sorting can be very useful when needed, but be aware that
unnecessarily specifying lexical ordering may degrade performance of
QSORT.
--- Examples ---
Produce a sorted directory listing and display it on the console a
screen's worth at a time:
A>DIR | QSORT | MORE
This demonstrates the use of QSORT as a "filter" in a "pipe." Produce
a directory listing sorted by creation date and time, and display it
on the console a screen's worth at a time:
A>DIR | QSORT /30:2 /24:5 /39 /34:5 | MORE
The output of the DIR command is piped to QSORT. The key fields
defined are, from left to right (major to minor), year (2 digits),
month and day, AM/PM flag and time. The output of QSORT is then piped
to MORE for display.
Next, replace the unsorted FILE.TXT with the same data sorted in
reverse order. Use columns 10 to 16 as the sort key:
A>QSORT FILE.TXT /-10:7
or
A>QSORT FILE.TXT /10:7 /R
or (SORT compatible)
A>QSORT FILE.TXT /R /+10
Next, perform a simple sort on a file with up to 240-byte records
A>QSORT LARGE.REC /M240
or
A>QSORT LARGE.REC
Note that the "/M240" parameter is no longer needed, but will not
hurt.
GLOSS.TXT is an unsorted glossary of terms. The term being defined by
each entry appears first, followed by several lines of definition.
The entries are separated by empty lines. Produce GLOSS.SRT, a sorted
version of the glossary:
A>QSORT /T <GLOSS.TXT >GLOSS.SRT (with redirection)
or
A>QSORT /T GLOSS.TXT GLOSS.SRT (without redirection)
A lawyer keeps a running log of his billable activities in TIME.LOG.
The first line of each entry is "mm/dd/yy hh:mm account#." He always
places a tilde (~) in the last line of each entry. He wishes to sort
the log by account number, and by ascending date and time within each
account:
QSORT /16:7 /7:2 /1:5 /10:5 /T~ TIME.LOG
The directory of users for a bulletin board system is kept in a binary
file of fixed-length records 180 bytes long. The user name is a
26-character field beginning in the first position and the city/state
field is a 16-character field beginning in the fortieth position.
Sort the file by city/state and name.
QSORT /F180 /40:16 /1:26 USER.BBS
--- Implementation Notes ---
QSORT is intended as an enhanced replacement for DOS SORT. It is
nearly fully upward compatible, but provides much more flexibility.
Multiple sort keys may be specified, a pseudo in-place sort may be
performed and files and/or records of any size may be sorted provided
only that there is sufficient disk space for work files and the output
file. QSORT uses the "quick sort" algorithm, which cannot guarantee
the order of records whose keys are all equal. This is the one
"incompatibility" with DOS SORT, which retains the original order of
records when its only key compares equal. This is important to SORT
because it must be invoked multiple times to effect a multiple key
sort. With QSORT, you only sort once and there are usually enough
keys available to insure you get the order you want the first time.
QSORT uses a sort buffer of about 50K bytes and will fill the buffer
as full as possible, and then sort the contents of the buffer. If it
has reached the end of the input file and has not yet generated any
work files, the contents of the buffer are written to the output file,
completing the sort operation.
If the input file is too large to fit into the sort buffer, as much of
the input file as possible is read into the buffer, sorted, then
written to a temporary work file. This process is repeated as many
times as necessary to process the entire input file, each time
creating a new work file for the sorted output.
Upon completion of the "sort phase," QSORT begins a "merge phase."
Each work file is a sorted sub-set of the input file. Thus, work
files may be read sequentially and combined to produce a sorted
output. QSORT will open as many work files as DOS permits (more on
this later). If all the remaining work files can be opened, the
sorted result is written to the output file. Otherwise, a new work
file is created and another merge pass will be required. On each
merge pass, the number of work files is reduced and eventually all
remaining work files will be opened and the sorted output file will be
written completing the sort operation.
QSORT is smart enough to never have just one work file remaining,
which would require an unnecessary copy operation. In fact, QSORT is
smarter than just that in its handling of the merge phase. If more
than one merge pass is required, all the data merged during the first
pass will have to be merged again, so QSORT attempts to minimize the
first pass. For example, if QSORT discovers it may only open 15 files
at a time, and there are 16 temporary files, it will only merge two
files on the first pass, creating a 17th file as it does. Then in the
second pass, it will merge all 15 remaining files to the output file.
The less data it processes twice, the faster it performs the sort!
With nothing else to guide it, QSORT places its temporary files in the
default directory. Either of two "environment variables" can override
this. (See your DOS manual for information on environment variables
and the SET command.) The command:
SET QSTMP=<path> or
SET TMP=<path>
will define a path for QSORT to use for its temporaries. QSORT first
looks for the environment variable QSTMP. If it does not exist, QSORT
next looks for TMP. TMP is a de facto standard used by many programs,
and is usually defined in your AUTOEXEC.BAT batch file. You might
have TMP specifying a 64K RAM disk to speed up your compiler. In this
case, an attempt to sort a 100K file is doomed to failure. Rather
than redefine TMP, you may define QSTMP to force QSORT to use some
directory on your hard disk. In fact:
SET QSTMP=.
tells QSORT to always use the default directory anyway!
QSORT, to work properly, needs enough space on the output disk to hold
the output file. Even if the input file is to be deleted and resides
in the same directory, that is not done until after the output file
has been successfully written. If one merge pass is required, the
disk QSORT uses for temporaries should have about 10% more space than
the size of the input file for temporary merge files. If more than
one merge pass will be required, allow about twice the size of the
input file as temporary merge file space.
One of the advantages of controlling where QSORT places its temporary
files is to insure adequate space for them. A second is speed. If
the temporary files can be placed on a separate disk from the input
and output files, disk seeking is minimized and performance improved.
Each time QSORT must create a new temporary merge file, the data put
into it will be processed again. Obviously, the more files QSORT can
open during the merge phase, the fewer times it will have to handle
each record and the faster it can sort large files. If DOS is
properly pre-conditioned, QSORT can have up to 15 temporary merge
files open at once, and very large files can be sorted with just one
sort pass and one merge pass. Unfortunately, that capability is not
automatic.
DOS has a fixed number of file "handles" that it associates with open
files. The default number is eight, but DOS opens five of them for
standard input, standard output, standard error, standard printer and
standard auxiliary device. That leaves three for merging. A 250K
input file would produce five temporary merge files and that would
take three merge passes; merge two into one, leaving four; merge two
into one leaving three; and finally merge three into the output file.
In the process, QSORT must read and write about 80% of the file twice
during the merge phase.
Worse yet, since you need at least three handles for merging, if you
have resident programs that have open files, you can't merge at all!
DOS can be told to set aside more space for file handles. Each handle
is only 39 bytes and it's memory very well spent. One process can
have a maximum of 20 handles open at one time, but since resident
processes may be using handles, I recommend 25 to 35. To do this, the
root directory of the DISK OR DISKS YOU BOOT FROM must contain a file
named CONFIG.SYS. If your boot disk(s) already contains a CONFIG.SYS,
edit it, or if not, create it to contain the following line:
FILES=25 (or more)
While we're at it, let's add one more thing to CONFIG.SYS which will
improve the performance of QSORT and many other programs as well. DOS
provides, by default, two disk buffers. These are the buffers it uses
to do its disk reads and writes. During the merge phase QSORT may
have many files open at once, reading from them in more or less random
order. DOS may have to read the same physical sector several times to
get all its data. But DOS can remember what's in each buffer and
where it came from, and will not re-read a sector it already has in a
buffer. DOS needs 528 bytes for each buffer. I recommend 20 buffers
to make QSORT perform well under the most adverse conditions. This
will require an additional 9504 bytes or slightly more than 9K, again
memory well spent, so we add to CONFIG.SYS the following line:
BUFFERS=20
See your DOS manual for more information on CONFIG.SYS.
I hope you find this program useful. Your comments and suggestions
are welcome. My address is at the top of this file.